Add exploration to compare wte and scale#670
Open
klei22 wants to merge 3 commits intoReaLLMASIC:masterfrom
Open
Add exploration to compare wte and scale#670klei22 wants to merge 3 commits intoReaLLMASIC:masterfrom
klei22 wants to merge 3 commits intoReaLLMASIC:masterfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds configurable normalization layers for word token embeddings (WTE) and absolute position embeddings (ABS), with support for HyperSphereNorm variants including gain and scale parameters. It refactors the embedding scale initialization and improves checkpoint saving behavior.
Key changes:
- Introduces norm_variant_wte and norm_variant_abs with configurable radius, scale, and gain parameters
- Refactors HyperSphereNorm to support a const_radius_factor scaling mechanism
- Adds embedding_scale_init configuration option for custom initialization
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| variations/norm_variations.py | Refactors HyperSphereNorm to support scale factor and updates forward to apply gain parameter |
| train_args.py | Adds CLI arguments for WTE/ABS norm configurations, embedding scale init, and hsnorm_scale parameter |
| train.py | Simplifies checkpoint saving logic by removing redundant never_save_checkpoint check |
| model.py | Moves post-embedding norm instantiation to transformer ModuleDict and reorders embedding operations |
| gpt_conf.py | Adds configuration fields for WTE/ABS norm parameters and embedding scale initialization |
| explorations/norm_wte_abs_sweep.yaml | Adds experiment configuration for normalization sweep experiments |
| explorations/norm_wte_abs_embd_scale.yaml | Adds experiment configuration with embedding scale variations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| radius = self.const_radius_factor * self.radius_init_factor | ||
| hypersphere_norm = x.norm(2, dim=-1, keepdim=True) | ||
| return x / hypersphere_norm * self.radius | ||
| return x / hypersphere_norm * radius * self.gain |
There was a problem hiding this comment.
Extra space in 'return x' should be 'return x' (single space).
Suggested change
| return x / hypersphere_norm * radius * self.gain | |
| return x / hypersphere_norm * radius * self.gain |
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces new configuration options and experimental YAML files to enable fine-grained control and experimentation with normalization strategies for word token embeddings (WTE) and absolute positional embeddings in the model. The changes allow for flexible application of HyperSphereNorm (HSNorm) to these embeddings, with tunable parameters such as radius, scale, and gain, and provide infrastructure for running large sweeps and ablation studies on these normalization settings.
The most important changes are:
1. Experimental configuration for normalization sweeps and ablations
norm_wte_abs_sweep.yamlandnorm_wte_abs_embd_scale.yaml, which define comprehensive parameter sweeps over normalization variants, radius, scale, and gain for WTE and absolute position embeddings, including baseline and ablation runs. These files facilitate systematic experimentation. [1] [2]2. Extended model and config to support post-embedding normalization
GPTConfigingpt_conf.pyto include new attributes for WTE and absolute embedding normalization:norm_variant_wte,norm_wte_radius,norm_wte_scale,norm_wte_gain,norm_variant_abs,norm_abs_radius,norm_abs_scale,norm_abs_gain, and related parameters for HyperSphereNorm. [1] [2]train_args.pyto expose these new configuration options as command-line arguments, allowing them to be set via CLI or YAML. [1] [2] [3]3. Model logic for embedding normalization and scaling
model.pyto apply the specified normalization (e.g., HSNorm) to WTE and absolute position embeddings, using a helper method to build normalization layers with the correct parameters. Also improved embedding scaling logic to allow explicit initialization. [1] [2] [3] [4] [5] [6]4. HyperSphereNorm improvements
HyperSphereNorminnorm_variations.pyto support a scale factor (hsnorm_scale), and clarified the logic for whether the radius is learned or fixed, improving flexibility for experiments. [1] [2]5. Minor fixes and usability improvements
These changes collectively enable more systematic research into the effects of normalization on embedding layers, with a flexible, configurable setup for large-scale experimentation.